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Abstract — We consider covarlance estimation in the multivari- 
ate generalized Gaussian distribution (MGGD) and elliptically 
symmetric (ES) distribution. The maximum likelihood optimiza- 
tion associated with this problem is non-convex, yet it has been 
proved that its global solution can be often computed via simple 
fixed point iterations. Our first contribution is a new analysis 
of this likelihood based on geodesic convexity that requires 
weaker assumptions. Our second contribution is a generalized 
framework for structured covariance estimation under sparsity 
constraints. We show that the optimizations can be formulated 
as convex minimization as long the MGGD shape parameter 
is larger than half and the sparsity pattern is chordal. These 
include, for example, maximum likelihood estimation of banded 
inverse covariances in multivariate Laplace distributions, which 
are associated with time varying autoregressive processes. 

Index Terms — Multivariate generalized Gaussian distribution, 
geodesic convexity, graphical models, Cholesky decomposition. 



I. Introduction 

Covariance estimation is a fundamental problem in mul- 
tivariate statistics. Many techniques for hypothesis testing, 
inference, denoising and prediction rely on accurate estimation 
of the true covariance. The problem is challenging when the 
available data is high dimensional and non-Gaussian. Such 
settings are typical in many applications including speech, 
radar, wireless communication, finance and more. These led 
to a growing interest in both robust and structured covariance 
estimation. Specifically, in this paper, we consider maximum 
likelihood estimation (MLE) in the multivariate generalized 
Gaussian distribution, with and without sparsity constraints 
on the inverse covariance. 

The first part of this paper considers the geodesic convexity 
in MLE in elliptically symmetric (ES) distributions. Methods 
for robust covariance estimation date back to the early works 
of |18|, |31|. A popular approach is Tyler's scatter estimate. 
It involves a non-convex optimization yet can be solved via 
simple fixed point iteration. It has been rigorously analyzed 
and successfully applied to different problems lISTI . Il26l . 
Recently, it was shown that the result is in fact the solution 
to a geodesically convex minimization ||6l, ifTOl . ["321, f33l, 
||36|. Geodesic convexity ensures that any local minima is 
also globally optimal and leads to a much simpler analysis 
El l. It also allows for numerous extensions, e.g., regularized 
solutions. In a competing line of works, a different class of 
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robust covariance estimation techniques was proposed based 
on the MGGD El, lHH. A well known example of MGGD 
is the multivariate Laplace distribution lfT4l . Fixed point iter- 
ations for MGGD estimation and their analyses has recently 
been considered in |24|, 19], |25j. The first contribution in this 
paper is a new analysis which shows that the negative log- 
likelihood in MGGD is also geodesically convex. This result 
requires weaker conditions than previous analyses, provides 
more intuition and paves the road to numerous generalizations. 

The second part of this paper addresses structured covari- 
ance estimation. Structure exploitation is a main ingredient 
in modem statistics that allows accurate high dimensional 
estimation via a small number of samples. A promising 
approach is based on sparse inverse covariance models. In 
the multivariate Gaussian case these are known as graphical 
models and characterize conditional independence ITSl . Il20l . 
El, H. These models have been successfully applied to 
speech recognition, sensor networks, computer networks and 
other fields in signal processing fTOl, f35l. Our goal is to 
combine such models with non-Gaussian distributions, e.g., 
MGGD. Recently, a similar problem was addressed using 
an expectation maximization technique fTSl. Another line of 
work focused on combining Tyler's scatter estimate with a 
banded inverse covariance prior fS), IT]. Graphical models 
for transelliptical distributions were discussed in f2T). In 
this paper, we combine the MGGD framework with prior 
sparsity constraints on the inverse covariance. We show that 
the optimization can be formulated into a convex form as 
long as the MGGD scale parameter is larger than half and 
the sparsity satisfies a chordal structure. Chordal models, 
also known as decomposable or triangulated models, include 
banded structures, multiscale settings and other practical sce- 
narios 1341 . lfT6l . lfT2]| . Such structures are associated with a 
perfect ordering of the variables. A typical example is banded 
models associated with time-varying autoregressive processes 
0. 

Our results are also applicable to the case of unknown 
sparsity pattern, i.e., structure learning via sparsity inducing 
penalties, but require prior knowledge of the perfect order 
Recent works on structure learning in directed acyclic graphs 
provide data driven techniques for learning this order |30|, 

m. 

The outline of the paper is as follows. We begin in Section 
In] with a few mathematical preliminaries that will be useful in 
our work. Then, we continue to our two main contributions. 



In Section III we provide a new geodesic analysis of MGGD 



estimation, and in Section IV we introduce a convex opti- 
mization framework for chordal structured MGGD estimation. 



Simulation results are described in Section |VJ and concluding 
remarks are offered in Section IVTl 

We use the following notations. We denote the set of real, 
symmetric and positive definite matrices by S+_|_(p) C M^^^. 
We denote the span operator by sp{}. 

II. Preliminaries 

We begin with a brief review of two mathematical concepts 
which will be instrumental in the next sections. 

A. Geodesic convexity 

Geodesic convexity is an extension of classical convexity 
which replaces lines with geodesic paths in manifolds. More 
details on this topic can be found in |27|. Given a Riemannian 
manifold Ai and a set ^ C Al, we say a function f : A ^ 
M is geodesically convex, if every geodesic '^^y of Ai with 
endpoints x,y E A (i.e., j^y is a function from [0,1] to 7W 
with 7x1/(0) = X and jxyi^) = v) lies in A, and 



ni.y{t))<(l-t).f{x) + tf{y) 
for any x,y E A and < t < 1. 



(1) 



If the inequality in ([TJ is replaced by strict inequality, we 
call the function / geodesically strictly convex. An equivalent 
definition follows from ll22l Theorem 1.1.4]. 

Proposition 1: For continuous function /, the definition in 
([TJ is equivalent to the condition 



f{i.ym<y{x) + y{y) 

for any x,y E A. 



(2) 



The importance of geodesic convexity stems from the fol- 
lowing properties (see EtI Theorem 2.1] for more details). 

Proposition 2: Any local minimizer of a geodesically con- 
vex function is also its global minimizer. 

Proposition 3: Any strictly geodesically convex function 
has a unique global minimizer. 

In particular, we consider geodesic convexity on the man- 
ifold of positive definite matrices denoted by S++(p). The 
geodesic connecting Si E S++(p) and S2 E S++(p) is 
defined as Q Chapter 6] 



Ts.s.W^Sf (Si' 



^2^^! 



-'1 • 



(3) 



Geodesic convexity of Tyler's likelihood has been identified 
in 0, ||32l, lES, 1361. In Section |III] we continue this Hne of 
works and show that the MGGD likelihood is also geodesically 



B. Chordal graphs 

In statistical graphical modeling, graphs are used to char- 
acterize the sparsity pattern of the corresponding inverse co- 
variance matrices. When the pattern belongs to a special class 
known as chordal graphs, these concentration matrices satisfy 



an appealing structure which will be exploited in Section IV 



A graph G{V, E) is chordal if every cycle of length > 4 
has an edge joining two nonconsecutive vertices of the cycle. 
For a chordal graph, there is a perfect elimination ordering of 
vertices, (ui,W2,-'' I'^n) such that for any 1 < i < n, the 
neighbor of Vi, Adj(fi) — {u E V : {u,Vi) E E), satisfies 
that Ad]{vi) n {ui+i, Wi-i-2, • • • , v,i} induces a fully connected 
clique, i.e., a set of fully connected nodes. 

It is convenient to define the sparsity pattern of a square 
n X 71 matrix C via a graph G{V, E) with n vertices. We say 
that C is G-sparse if 



[C],, =0, for all {1,3) iE. 



(4) 



More details on chordal graphs and their relation to graphical 
models can be found in ll20l. 041. lfT6l. ifTH. 



A Cholesky decomposition is a generalization of the squared 
root operation to positive definite matrices. Any S E S++(p) 
has a unique decomposition S = CC'^ where the Cholesky 
factor C is a lower-triangular matrix with positive diagonal 
elements. 

The following result characterizes the relation between 
sparse Cholesky decompositions and chordal graphs. 

Proposition 4: Let G(y^ E) be a chordal graph with a 
naturafl perfect order, i.e., Vi — i for i — 1, • • • ,n. If S is 
G-sparse, real and positive definite, then there exists a unique 
G-sparse lower triangular C with positive definite diagonal 
entries such that S = CC^ . If C is G-sparse and lower- 
triangular, then CC^ is G-sparse and positive definite. 
The existence proof can be found in (W, Section 2.1], and the 
uniqueness is a direct consequence of the perfect elimination 
order. 

Example 1: For a banded matrix with band width d, it 
corresponds to a graph G{V, E) defined as follows: {i,j) E E 
when |j — j| < d. Now we check that this is a chordal graph: 
for every cycle with length > 4, the index of the four nodes 
differ d at most, therefore the edges connect all nodes in the 
cycle, which corresponds to the definition of chordal graph. 

The elimination order for this chordal graph turns out to 
be the natural order Vi = i. Therefore Proposition |4] shows 
that the Cholesky decomposition of a banded matrix S is 
the product of a banded (with the same bandwidth d) lower- 
triangular matrix with its transpose. 

One can also easily show that the product of a banded lower- 
triangular matrix with its transpose is still a banded (with the 
same band width), positive definite matrix, which exemplifies 
the last sentence in Proposition |4] 

For more examples of Chordal graph and its associated perfect 
order, we recommend the examples and graphs in |i35i Section 
II. A] and ^ Section 3]. 

III. Geodesic convexity in MGGD 

In this section, we consider unconstrained MLE in MGGD. 
More precisely, we address a more general family of ellip- 
tically symmetric (ES) distributions (see fTTj for a review 
and |24| for a recent generalization to complex case). These 
problems involve non-convex minimizations, yet it has been 
shown that their global minima can be efficiently found using 
simple fixed point iterations. We will show that the negative 

' If the order is not natural, this result holds by permuting the columns of 
the matrix. 



log-likelihoods are in fact geodesically convex, and that this 
may be the underlying principle behind their success. 

An random variable z E W^ has a ES distribution in real 
space if its probability density function (p.d.f.) is 



/(^) 



^p,gl^l 



-"■'9{{^ 



Mf S-i(2-Ai)), (5) 



where g : [0,oo) — > (0,oo) such that J^ tP^^g{t'^) dt < oo, 
Cp,g is a normalization constant such that the integral of the 
distribution is 1. In (|5]), S e S++(|5) is called a scatter matrix, 
and /i e M^ is the center of the distribution. 

MGGD ||23]| is a widely used special case of ES when 



gix) = exp {-x^/2) 



(6) 



where /3 is the shape parameter In particular, for f3 = 0.5 
it gives multivariate analog of Laplace distribution, and the 
multivariate Gaussian distribution is obtained for /3 = 1. 

We consider the estimation of S given n independent 
and identically distributed (i.i.d) realizations of a zero mean 
ES random vector denoted by Zi, • • • ,Zn- The MLE is the 
parameter that minimizes the negative-log-likelihood 



io(S) 



1=1 



logdet(S), 



where p{x) = —\ogg{x). The following Theorem [T] charac- 
terizes the existence and uniqueness properties of this MLE. 
Theorem flta) characterize the uniqueness and is the main 
contribution in this section; Theorem [TJb) characterize the 
existence and is borrowed from ||3T1 Theorem 2.1]. 

Theorem 1: Assume that p{x) is continuous in (0, oo), 
nondecreasing and p(e^) is convex, then £o(S) is geodesically 
convex in S++(p). In addition, if s^p{zl^Z2^ ■ ■ ■ ,Zn} = M.P 
and for p{e^) is strictly convex, then Lo(S) is geodesically 
strictly convex in S+-|_(|5). 

(b) If p{x) is continuous in [0, oo), ai = 
sup{a|a;°/^ exp(— p(a;)) -^ as x — > oo} is positive, 
and X = {z,}f^^. 



l-Ynvi 



< 1 



p — dim(V) 



fli 



for any linear subspace V G 



(7) 
then there exists a minimizer of Lo(S). 

Before proving this theorem, a few comments are in order 
We remark that the condition spj^i, Z2,- ■ ■ , 2;„} = M^ is also 
implicated by (|7|, since otherwise V = sp{zi, Z2,- ■ ■ ,Zn} 
violated the assumption. This condition is necessary since 
otherwise io(-fV) — > — oo as S — > Py, where Py is the 
projector to subspace V) and therefore the minimizer of LqCE) 
does not exist. 

Theorem [T] relaxes the conditions for the unique- 
ness/existence of the minimizer of io(S) presented in |fT9l 
Theorem 2.2]: 

Ml p'(x) is non-negative, continuous and nonincreasing. 
M2 xp'{x) is strictly increasing. 
M3 condition ^. 

Our assumption that p is nondecreasing correspond to Ml, 
where p' is nonnegative; and our assumption that p(e^) is 
convex/strictly convex corresponds to M2, where xp'{x) is 



nondecreasing/strictly increasing. In comparison, our condi- 
tions does not require p to be differentiable, and we do not 
assume their assumptions in Ml that p'{x) is continuous or 
nonincreasing. 

A special case of Theorem [T] is Tyler's M-estimator in which 
p{x) — \og{x)/2p. Geodesic convexity of this estimator has 
been previously identified in 16], lf32l . ll33l . 1361 . Another 
special case is the class of MGGD estimators. The following 
corollary follows from Theorem [T] and the fact that ai in 
Theorem [TJl)) is oo. 

Corollary 1: For all /? > 0, the ML estimator for MGGD 
exists and unique if spj^i, Z2, ■ ■ ■ , z„} — W. 

Existence and uniqueness of the MLE in MGGD has been 
previously addressed in [24, Section V.A]. However, this 
contribution does not identify geodesic convexity and applies 
only to < /3 < 1. Our theorem applies to MGGD with 
all /3 > and provides additional insight based on geodesic 
convexity. Furthermore, it applies to other ES distributions, 
including cases where p'{x) is not continuous, e.g., p{x) — x 
when a; < 1 and p{x) = 3a; — 2 when x > 1. 

Here we remark that although ML estimator for elliptically 
symmetric (ES) distribution is the motivation of the argument. 
Theorem [T] (and Theorem [2] in next section) hold even if p{x) 
does not relate to an elliptically symmetric (ES) distribution. 
For example, Tyler's M-estimator can be written in the form 
of (|5]l with p{x) — log(x)/2p; however this p{x) corresponds 
to the central angular Gaussian which is not a member in 
ES class lEH (36)]. Another example is in ^ page 5610, 
Example 1], where p' is set to be a huber function. 

We remark that the Theorem [T] also applied to the complex 
elliptically symmetric (CES) distributions defined by 



f{z) - Cp^glSr^g {z - pf S-i [z - p) 



(8) 
where the MLE estimator minimizes 

n 

Lo(S) = ^p(2f S-i^,) +nlogdet(S). 

i=l 

Then Theorem fl] holds with ai = sup{a|x° exp(— p(a;)) — ?> 
as z — > oo}. This is also a generalization of the unique- 
ness/existence of the minimizer of £o(S) presented in ll24l 
Section V.A]. 

Proof: (a) Applying |33, Proposition 1], if S3 is the 
geodesic mean of Si and S2 defined in ([3]), then 

In {z^Yl^^z) + In {z'^Y.^^z) > 2 In {z^H^^z) . 

Combining this fact with the convex/monotone properties of 
p, we have 

p (zf-E^'zi) + p {zf-E^'z^ (9) 

>2p (exp ((In (zf-E^'z,) + In (zf-E^'z,)) /2)) 
>2p (exp (In (2f S3-I2,))) = 2p (zfE^'z,) , (10) 

Since 

logdet (Si) + logdet (S2) = 21ogdet (S3) , (11) 

we proved 

Lo(Si) + io(S2)>2Lo(S3) (12) 



By Proposition [11 we proved the geodesic convexity of La{Ti). 

m 

In practice, various numerical techniques can be used to 
find a local minima of the MGGD negative log-likelihood. 
Theorem [T] and geodesic convexity then ensure that this local 
minima will also be global minima. A promising approach is 
the classical iterative reweighed scheme due to |fT9l . 0: 



n 



S™+i = 2^u{z^ S,„ z,)z,z^ /n, 
1=1 



(13) 



where u{x) — p'{x). It has been shown in ||5] Proposition 
1] that when Sp(2i, 2:2, • • • ,2„) = MP, p"{x) is continu- 
ous and nonnegative, then io(Sm) decreases monotonically 
(Sp(zi, Z2, • • • , Zn) = M.P is assumed in order to make sure 
that Lo(Sm+i) is well-defined). |5, Proposition 3] shows that 
any limiting point of the sequence S„j is a stationary point. 
When the assumptions in Theorem [T] hold, this point is the 
unique minimizer of io(I]). 

IV. Convexity in chordal MGGD 

In this section, we consider structured MGGD estimation. 
In particular, we consider MLE of the MGGD scatter matrix 
subject to sparsity constraints. Thus, we are interested in the 
solution to 

mins X]"=i P i^T'S^^Zi) + nlogdet(S) 
s.t. [S-^]„. =0, {i,j)(^E 



(14) 



where the objective is the negative MGGD log-likelihood 
as described in Section III and E is the edge set of a 
known graph G{V,E) associated with the sparsity of S^^. 
Unfortunately, the above minimization is not convex in S or 
S^^. Nor is it geodesically convex in it. Indeed, the sparsity 
constraints are not preserved by the positive definite geodesic 
in (|3]l. Thus, it is not clear whether its global minima can be 
found in an efficient manner. 

In what follows, we propose a simple trick to "convexify" 
the optimization in many interesting cases of MGGD. In 
particular, we assume the G is chordal and represent S^^ 
with its Cholesky factor C. Due to Proposition HI a unique 
chordal decomposition exists and we obtain 



mincEi; ElLi P {\\C^ ^^r) - ^n^U ^""^^^^^ 



s.t. 
where 



[C]„. =0, (z,j)^i?, 



y = <ce 



^pxp 



and we have used 



nlogdet (CC 



a,, > 

Ci,j ~ for all i < j 



-2n^log(C,-,). 



(15) 



(16) 



(17) 



The following theorem characterizes the properties of this 
optimization. 

Theorem 2: (a) Assume that p{x) is continuous in (0, 00), 
nondecreasing and p{x^) is convex, the objective of (15 1 
is strictly convex, (b) When spj^i, 2:2, • • • ,-z„} = MP7a 
minimizer to ([TSj exists. 



Proof: Negative logarithms are strictly convex, and it 
remains to show that 



C1+C2 



< 



\Cjz, 



1^2 Zi 



<\p{\\Clz,f) + \p{\\C^,z.,f) 



(18) 



where we have used the triangle inequality and the convexity 

ofp(a;2). 

To prove the existence of minimizer, we need to prove 
that for L,{C) = Eti P {\\C^ ^^\\') - ^^EU^ogiC,,,), 
Li{C) -> 00 as ||C||f — > 00 or the smallest eigenvalue of C 
goes to 0. When ||C||f -^ 00, since sp{zi, Z2, • • • , ^n} ~ K^ 
and C is full rank, there exists C2 such that X]"=i llC'^Zijp > 
C2||C'|||.. Consider that p{x'^) is convex, there exists a constant 
Ci and Ci such that when - X^iLi llC'^^-ZilP > Ci, 



Y.pi\\c' 



> np{- 



A^ 



\C^z,f) 



>nci 



\ 



1 " 

-V||C^2j2>7^CiC2|lC|lf. 
n ^ — ^ 



Therefore as ||C||i? — > 00, Li{C) —>■ 00. 

When the smallest eigenvalue of C goes to 0, we have 
dct(C) -^ and Ei=i log(Cjj) = logdct(C) -^ -cx3. In 
another aspect, since p{x'^) is convex, liminfa;_j.o /o(a;) exists 
(by convexity it is larger than 2p(l) — p(4)). Combining it 
with the fact that p{x) is nondecreasing, Li{C) — ?► cx) as the 
smallest eigenvalue of C goes to 0, and the existence of the 
minimizer of Li{C) is proved. ■ 

The theorem shows that, under suitable conditions, any local 



minimizer of ( 15 1 is globally optimal. This holds for any ES 
distribution with p satisfying the assumptions. In particular, an 
important special case is chordal graphical models in MGGD: 

Corollary 2: If sp{zi,Z2,- ■ ■ , Zn} = W, under chordal 
graphical models the ML estimator for MGGD exists and 
unique for all (3 > 0.5. 

Unfortunately, the recent result in |2| on p(x) = log(a;)/2p 
does not satisfy our conditions and requires a different analy- 
sis. 

When the condition in Theorem |2] holds, the objective 
function is convex with respect to C. Therefore, it can be 
numerically minimized via any general convex optimization 
solver. For example, in the MGGD case with /3 > 0.5 the 
problem can be expressed as 

^^cey,t Er=i \U\'^ - ^nEU l°g(<^^-.^-) 

s.t. ti > \\Cz,\\ (19) 

where t are auxiliary variables. This formulation with second 
order cone constraints can be easily solved using the popular 
CVX package HT]. 



Alternatively, the optimization can be addressed using a 
majorization - minimization (MM) technique, e.g. |5|. The 
method begins with an initial estimate Sq G S++(p) and 
updates it according the following iterations 

- r mills /i(S,S„) 

^™+i = ^'^g ( s.t. [S-i]^^.=0, {^.J)iE (20) 

where 

n 

/^(E, S„,) = Y. u{zf-E;^^z,)zf-E-^z, + nlogdet(S) + C, 

1=1 

(21) 

u{x) = p'{x) and C is chosen such that 

/i(Sm, S„i) = Lo(Sm). Since p"(a;) is continuous and 

nonnegative, we have 



Loi'^m+l) > ^(5],Sm) > /i(S.,„,Sm) — Lo{'Sr> 



(22) 



and Lo(Sm) converges as m — > oo. Each iteration step in 
([20ll-([2T|i can be interpreted as a Gaussian graphical model 
optimization (the weights u(-) are constant with respect to 
the optimization variable). These minimizations have a simple 
closed form solution when G{V, E) is chordal (see appendix). 
Thus, the proposed technique is very efficient for implemen- 
tation in practice. 

Finally, it is worth mentioning that our framework can 
also be extended to structure learning in MGGD graphical 
models. Structure learning, also known as covariance selection, 
considers the estimation of an inverse covariance which is 
known to be sparse but the sparsity pattern itself is unknown 
11131, fSl, f29l. Adding a sparsity constraint usually destroys 
the convexity. Instead, the modern approach relies on a convex 
relaxation based on an LI norm penalty. In our context, the 
problem is convex in the Cholesky factor rather than in the 
inverse covariance itself. This leads to the following convex 
minimization 



minVp(||C^,||2)-2n^log([C],-,) 

i=l 3=1 



A||C||i (23) 



where A is a regularization parameter, and ||C||i is a matrix 
version of the LI norm, namely a sum over the absolute values 
of the elements in C It is important to emphasize that this 
approach is only applicable to chordal graphical models and 
it assumes that the perfect order of the variables is known a 
priori. Recent developments in high dimensional covariance 
estimation provide data-driven methods for identifying this 
order and structure ||30l . ||28l . We leave this topic as a possible 
direction for future research. 

Furthermore, our methodology can be easily extended to 
joint estimation of the covariance and the centering parameter 
/i. For this purpose, consider the optimization of 

mincej^,M ELi P (llC^^;, - xf) - 2nYJ'^^^ log(Cjj) 
s.t. [C],j-=0, {i,j)iE 

(24) 

where x = C^ fi When the conditions in Theorem |2[ a) hold, 
the objective function is jointly convex with respect to (C, x), 
therefore any of its local minimizer is also its global minimizer 
Its algorithm can be similarly addressed using a majorization 
- minimization (MM) technique as in (|20]i-(pT|. 



V. Numerical results 

In this section, we present results of numerical simulations. 
The purpose of these simulations is three fold: to validate our 
theoretical results using synthetic data, to provide insight on 
a few open theoretical questions using synthetic data and to 
demonstrate the usefulness of our theoretical models in a real 
world setting. 

The first experiment considered inverse covariance estima- 
tion using synthetic data generated from a known MGGD 
distribution with /3 > 0.5 and a chordal structure. The theory 
in Section |IV] shows that, in this case, the MLE can be 
formulated as a convex optimization. To verify this result, 
we generated n i.i.d. realizations of an MGGD with p — 10, 
/3 — 0.5 (corresponding to a classical multivariate Laplacian 
distribution), and a banded inverse covariance of width b — A. 
In particular, we used a Toeplitz inverse covariance with 1.0 
diagonal elements and 0.4 off-diagonal elements within the 
main band. We estimated the unknown covariance using five 
estimators: 

• G: A Gaussian MLE with no prior knowledge of the 
structure, corresponding to the classical sample covari- 
ance. 

• BG: A Gaussian banded MLE as detailed in the appendix. 
. MGGD: An MGGD MLE with no prior knowledge of 

the structure via 30 iterations of ( fT3] ). 
. BMGGDl: An MGGD banded MLE via 30 iterations of 



(20 1 and the subroutine in the appendix. This estimator 
is initialized with S = I. 
. BMGGD2: An MGGD banded MLE via 30 iterations of 
( [20| and the subroutine in the appendix. This estimator 
is initialized with S = S (which is clearly impossible in 
practice). 

In all estimators above we use 30 as the number of iterations 
since in our simulations the fixed point algorithms ([T3| and 
( pOl converge well after 30 iterations. 

Figure 1 shows the normalized mean squared Frobenius 
error in the covariance averaged over 10000 independent 
simulations as a function of the number of samples n. It 
is easy to see the performance advantage of banded MGGD 
estimator. As expected, BMGGDl and BMGGD2 converge to 
the identical fixed points irrespective of their initial condition. 

The second experiment is very similar to the first except 
that /3 — 0.2. Our theory assumes /3 > 0.5 and therefore does 
not hold for this value of shape parameter Yet, according 
to Q banded Tyler estimators do converge to unique fixed 
points, and these can be interpreted as the limit of MGGD 
when /? — > 0. Our proposed fixed point iteration in (20 1 can 



be implemented with a small /3. Therefore, it is interesting to 
examine its performance and we repeat the first experiment 
with /3 = 0.2. The results are presented in Figure 2. Interest- 
ingly, BMGGDl and BMGGD2 converged to identical fixed 
points irrespective of their initial condition. Thus, although we 
have no proof for this behavior, we conjecture that, in practice, 
the iteration can also be used for all MGGDs. 

The third experiment focused on non-chordal structures. Our 
theory assumes a chordal sparsity pattern and it is interesting 
to see whether this assumption is indeed critical. Thus, we 



repeated the first experiment associated with MGGD (3 = 0.5, 
but replaced the banded inverse covariance with a loopy 
structure. In particular, we constructed a two dimensional grid 
graph of size p — i"^ with the edges (3, 1), (5, 1), (6, 1), 
(7,1), (8,1), (9,1), (4,2), (6,2), (7,2), (8,2), (9,2), (4,3), 
(5,3), (7,3), (8,3), (9,3), (6,4), (8,4), (9,4), (7,5), (9,5), 
(7, 6), (8, 6), and (9, 7). All the non-zero off diagonal elements 
were assigned a constant value small enough to ensure that 
S~^ will be well conditioned. In contrast to the chordal 
case, MLE in general Gaussian Graphical models in does not 



satisfy a closed form solution. Instead, we solved ( 14 1 using 
the CVX optimization package |17|. The latter was used for 
implementing BG, and for the inner solution in each iteration 
of BMGGDl and BMGGD2. The results averaged over 400 
simulations^ are provided in Figure 3. Here too, the iterations 
converged to identical solutions and seem to be independent 
of their initial conditions. 

The fourth experiment addressed the practical use of the 
BMGGDl estimator in a real world example. Following ||29| 
we considered the SONAR dataset from the UCI machine 
learning data repository. This dataset has 111 spectra from 
metal cyUnders and 97 spectra from rocks, where each spec- 
trum has 60 frequency band energy measurements. Quadratic 
discriminant test is used to classify the metals and the rocks. 
It requires the estimation of the covariance in both classes, 
and previous work demonstrated the advantage of BG over 
the classical sample covariance and the naive diagonal es- 
timate. We repeated the experiment step by step and added 
BMGGDl. Specifically, we chose /3 as the parameter within 
{0.5, 0.6, • • • , 0.9, 1.0} that maximizes the MGGD likelihood, 
and used the same band which was used by BG (selected via 
10 random splits with 1/3 of the data for training and the 
validation likelihood). For the QDA test we applied the co- 
variance of MGGD, which is c(/3)S, where c(^) = 2^ ^-^X^. 
The test errors over a standard leave-one-out cross validation 
are provided in Table I. These results remained stable over 
different randomizations, and demonstrate the advantage of 
the proposed BMGGDl framework. 




Fig. 1. MGGD with , 
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0.5 and a banded inverse covariance . 
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Fig. 2. MGGD with /3 = 0.2 and a banded inverse covariance . 



TABLE I 

Test errors in SONAR dataset. 



Sample covariance 


Naive Bayes 


BG 


BMGGDl 


24.0% 


32.7% 


15.4% 


13.5% 



VI. Discussion 

In this paper, we consider covariance estimation in the 
multivariate generaUzed Gaussian distribution (MGGD). We 
proved that the MGGD negative log-likelihood is geodesically 
convex for /? > 0. In the sparsely constrained case, we 
proved that a simple change of variables can transform it into 
a convex function as long as /3 > 0.5 and the underlying 
graph is chordal. This means that any local solution of 
these minimization is globally optimal and the problems can 
be solved using standard descent methods. In practice, we 




Number of samples 



Due to CVX these simulations are highly time consuming. 



Fig. 3. MGGD with /3 = 0.5 and a non-chordal graphical model. 



observed this behavior also for smaller values of /3. This agrees 
with a similar result on banded Tyler methods which can be 
interpreted as /3 — ?► 0. An interesting direction for future work 
is a rigorous analysis of these phenomenon when < /3 < 0.5. 
Another direction is relaxing the chordal assumption on the 
sparsity pattern. Here too, our numerical experience suggests 
that simple descent methods converge to the global solution 
and are independent of their initial conditions. Finally, as 
mentioned above, it in interesting to examine the problem of 
structure learning in MGGDs via sparsity enforcing priors. 

Appendix 

In this appendix, we review a simple closed form solution 
for the MLE in chordal (decomposable) Gaussian graphical 
models ED], Ol, El- The problem is 



mins 

s.t. 



YA^iOiizfT. i^i +nlogdetS 



[S- 



0, {i,j)iE 



where 



U{Z^ S,„ 2. 



(25) 



(26) 



(27) 



are the iteration weights which do not depend on S. Using the 
chordal Cholesky decomposition S^^ — CC^ , the problem 
is equivalent to 

mine ELi "dlC^^J' - 2nE^.i log([C],, 
s.t. [C],^. =0, ii,j)iEoii>j 

This minimization is completely separable and each column 
of C can be simply obtained by a linear regression. Let 
{Zi}^_^ C M" be vectors consists of the i-th components of 
y/alzi,y/a^Z2, • • • , y/a^iZn, Ji = {i + I < k < p : {i, k) e 
E}, and assume that in the linear regression of Zi with respect 
to {Zj}j^j., the parameter on Zj is f3{i,j), and standard error 
is r,. Then 



[Ah - 

D = 
and finally C = 



[1 



1, when i — j 

I3{j,i), when i > j and {i,j) e E (28) 
0, otherwise. 

diag(ri,r2,--- ,rp), (29) 



DA. 
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